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INTRODUCTION 

The support vector machine (SVM) has played an 
important role in bringing certain themes to the fore 
in computationally oriented statistics. However, it is 
important to place the SVM in context as but one 
member of a class of closely related algorithms for 
nonlinear classification. As we discuss, several of the 
"open problems" identified by the authors have in 
fact been the subject of a significant literature, a 
literature that may have been missed because it has 
been aimed not only at the SVM but at a broader 
family of algorithms. Keeping the broader class of al- 
gorithms in mind also helps to make clear that the 
SVM involves certain specific algorithmic choices, 
some of which have favorable consequences and oth- 
ers of which have unfavorable consequences — both 
in theory and in practice. The broader context helps 
to clarify the ties of the SVM to the surrounding 
statistical literature. 

We have at least two broader contexts in mind for 
the SVM. The first is the family of "large-margin" 
classification algorithms — a class that includes boost- 
ing and logistic regression. All of these algorithms 
involve the minimization of a convex contrast or 
loss function that upper bounds the 0-1 loss func- 
tion. The SVM makes a specific choice of convex loss 
function — the so-called hinge loss. Hinge loss has 
some potentially desirable properties (e.g., sparse- 
ness) and some potentially undesirable properties 
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(e.g., lack of calibration to posterior probabilities). 
As we discuss, much of the theoretical analysis of 
the SVM is best carried out by focusing on convex- 
ity and abstracting away from the details of specific 
loss functions. 

Second, as the authors note, the SVM is an in- 
stance of the broader family of statistical proce- 
dures based on reproducing kernel Hilbert spaces 
(RKHSs). The authors' emphasis is on the use of 
RKHS methods to provide basis expansions for dis- 
criminant functions and regression functions. RKHS 
ideas have, however, been carried significantly fur- 
ther in recent years, enlivening areas of computa- 
tionally oriented statistics beyond classification and 
regression. We wish to convey some of the reasons 
for this broader interest in RKHS-based approaches. 

There are both computational and statistical mo- 
tivations for focusing on methods based on convex- 
ity and reproducing kernel Hilbert spaces. In the 
remainder of this discussion we attempt to disen- 
tangle some of these motivations, but we wish to 
emphasize at the outset that it is precisely because 
these methods bring computational and statistical 
considerations together that they are so interesting. 

CONVEXITY 

The SVM is one example of a general strategy 
for solving the binary classification problem via a 
"convex surrogate loss function." To develop this 
perspective, let us define binary classification as the 
problem of choosing a discriminant function / : X — > 
K that minimizes misclassification risk, 

R(f) = P(Y + sgn(/(X))) = E£(Yf(X)), 

where X £ X is the covariate, Y G {±1} is the bi- 
nary response, and £(a) = 1 for a < and = other- 
wise. The family of large-margin classification algo- 
rithms attacks this problem indirectly by minimiz- 
ing a quantity known as the (fi-risk, 

R^(f) = B<P(Yf(X)), 

where <f> : R — » R is a surrogate for the loss function I, 
and Y f{X) is called the margin of / on the observa- 
tion (X,Y). The margin indicates not only whether 
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the observation is correctly classified by /, but how 
close / comes to choosing the opposite label. The 
surrogate loss function (ft is chosen so that large mar- 
gins correspond to small losses. 

Given a data set (X±, Y\), . . . , (X n , Y n ), we can 
form the empirical (ft-risk 

1 n 

and attempt to minimize this quantity with respect 
to the discriminant function /. When (ft is chosen as 
a convex function and / is constrained to lie in a 
convex family of prediction rules, the minimization 
becomes a convex optimization problem. Contrast 
this with minimization of empirical 0-1 (misclassifi- 
cation) risk, 

1 n 

mmR(f) = -J2mf(X t )), 
J i=i 

a problem whose exact or even approximate solu- 
tion is known to be intractable for most nontrivial 
function classes T (e.g., Arora, Babai, Stern and 
Sweedyk, 1997). 

In the case of SVMs, the convex surrogate is the 
hinge loss 

, , _ J 1 — a, if a < 1, 
\ 0, otherwise, 

and the function / is chosen from the RKHS Tt de- 
fined by the kernel. However, the hinge loss is not the 
only convex surrogate worth considering. Using the 
binomial deviance function as a convex surrogate 
and optimizing over linear functions on BP yields 
logistic regression. Just as with the SVM, a nonlin- 
ear version of logistic regression can be defined by 
making use of reproducing kernels (Zhu and Hastie, 
2005). AdaBoost (Schapire and Singer, 1999) can 
also be interpreted as a large-margin method, with 
(ft(ot) = e~ a ; similar greedy ensemble methods corre- 
spond to other choices of (ft. 

The benefits of empirical convex risk minimization 
are not just computational. Searching for a predic- 
tion rule which achieves a large margin on many 
training examples, rather than just correctly classi- 
fying them, is an implicit form of regularization. For 
example, certain function classes of infinite Vapnik- 
Chervonenkis (VC) dimension, where empirical 0- 
1 risk minimization does not yield good classifiers, 
can be used effectively in the large-margin frame- 
work (Bartlett, 1998; Schapire, Freund, Bartlett and 
Lee, 1998). 



Taking the margin-based viewpoint highlights the 
important role convexity plays in the success of SVMs 
On the other hand, the authors' emphasis on the 
need to find a differentiable or smooth formulation 
seems misplaced. In the differentiable case, the key 
property of a convex objective function / is that, for 
any two points x,y in the domain, 

(1) /(y)>/(x)+(^) T (y-x). 

Thus, local behavior of / (its gradient at x) deter- 
mines a global lower bound on /. The existence of 
this bound makes possible efficient algorithms for 
convex optimization. However, property (1) holds 
in a slightly generalized form even for nondifferen- 
tiable convex functions. A subgradient of a convex 
/ at x is a vector g such that 

f(y)>f(x) + g T (y-x) Vy. 

The subdifferential df(x) of / at x is the set of /'s 
subgradients at x. The subdifferential is the natu- 
ral analog of the gradient for nonsmooth objectives: 
any point in df(x) provides the equivalent of prop- 
erty (1); € df(x) if and only if x is a global mini- 
mizer of /; and the subdifferential contains only the 
gradient at points of differentiability. Moreover, ev- 
ery convex function has nonempty sub differentials 
throughout the interior of its domain. There has 
been a great deal of successful research on efficient 
algorithms for nonsmooth convex optimization us- 
ing "bundle methods" based on subdifferentials. See, 
for example, Hiriart-Urruty and Lemarechal (1993), 
Borwein and Lewis (2000) and Boyd and Vanden- 
berghe (2004). 

Statistical Analysis 

To address issues such as statistical consistency 
and finite sample behavior in the large-margin frame- 
work, we need to decompose the risk R(f n ) into 
three components. Two of the components are the 
approximation error and the estimation error that 
are familiar from other areas of nonparametric statis- 
tics. The third component arises from the use of the 
convex surrogate (ft in place of the 0-1 loss. 

We can quantify the effect of this third component 
through an inequality of the form 

*ft(R(f) - R*) < R<j>(f) ~ K> 

where ip is a convex, nonnegative function and / is 
an arbitrary measurable function (Bartlett, Jordan 
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and McAuliffe, 2006). Notice that such an inequal- 
ity relates the excess risk, R(f) — R* , to the excess 
eft-risk, Rcf>(f) — -R^. Here, the Bayes risk, R*, is de- 
fined by R* = inigR(g), where the infimum is over 
all measurable functions, and R^ is the minimal <f>- 
risk, = inig R,p(g). An optimal inequality of this 
form has been obtained for any nonnegative surro- 
gate loss function (ft, where ip can be written explic- 
itly in terms of (ft (Bartlett, Jordan and McAuliffe, 
2006). In the case of SVMs, where (ft is the hinge 
loss, ift turns out to be the identity (Zhang 2004; 
Blanchard, Bousquet and Massart, 2006). 

Thus, for SVMs we can write an optimal upper 
bound on the excess risk as 

R(fn) ~ R* 

<R^fn)-R% 

= (R^U) - WJ^(/)) + (jnf R+(f) - B*j . 

This decomposition into the estimation error and 
the approximation error reflects the bias-variance 
trade-off, ubiquitous in nonparametric estimation. 
On the one hand, the function f n must come from a 
suitably simple class to ensure that the performance 
of f n on the finite training sample is representative 
of its true performance and so the estimation error 
R<f>{fn) — inf / e ?< R<j){f) is not too large. On the other 
hand, f n must be suitably complex, so that its (ft- 
risk is not too much larger than the optimal R*^. One 
common approach to this model selection problem is 
the method of sieves, where the function f n is chosen 
as the minimizer of the empirical (/>-risk over classes 
T n that grow progressively richer as the sample size 
n increases. The other common approach is to add 
a regularization term, so that f n is chosen as the 
minimizer of 

R4f) + \ n n(f) 

for some regularization functional £1 that penalizes 
complex functions and for some regularization coef- 
ficients A n . In the case of SVMs, the regularization 
functional is the squared norm in the reproducing 
kernel Hilbert space 7i. The regularization approach 
and the method of sieves are closely related. In par- 
ticular, since G7i and i?^(0) = 1 for the hinge loss, 
we know that 

\\fn\\n < T— (R<j>{fn) + ^n\\fn\\n) 
An 

<i(j^(0) + A n ||0|&) = -^. 



Thus, the SVM chooses a function from the sieve 
F n = {feH: \\ff H <l/\ n }. 

It is important to note that the regularization co- 
efficient A n must decrease with the sample size, so 
as to ensure universal consistency. Indeed, the ap- 
proximation error, infj e jr n R<f>(f) — R^, must go to 
zero. On the other hand, it should decrease suffi- 
ciently slowly that the estimation error, i?^(/ n ) — 
infy g jF n R<j>(f), also decreases to zero. (Consistency 
does not follow from uniform convergence of the em- 
pirical to expected error in a ball of fixed radius, as 
the appendix of the paper suggests.) 

One of the most important consequences of the 
choice of a kernel is the way it affects this trade-off 
between the estimation and approximation errors. 
We can view the kernel, and hence the norm in the 
RKHS, as defining a complexity hierarchy. A good 
kernel is one for which a good approximation to R*^ 
can be obtained with a function that is low in the 
complexity hierarchy, in the sense that it has a small 
norm. 

The paper lists several statistical issues as im- 
portant open problems, notably the finite-sample 
performance of SVMs and the estimation of a poste- 
riori class probabilities. In fact, bounds on the esti- 
mation error, R^{f n ) — infj g jr n R<f>{f), for finite sam- 
ple size have been known for years. These bounds 
are expressed in terms of properties of the eigen- 
values of the Gram matrix, the matrix whose en- 
tries are k(xi,Xj) (see, e.g., Shawe- Taylor, Bartlett, 
Williamson and Anthony, 1998; Bartlett and Shawe- 
Taylor, 1999; Williamson, Smola and Scholkopf, 1999; 
Mendelson, 2002; Bartlett, Bousquet and Mendel- 
son, 2005; Blanchard, Bousquet and Massart, 2006). 
Moreover, these results provide an essential founda- 
tion for proofs of consistency (Steinwart, 2002, 2005; 
Zhang, 2004). In contrast, while the VC dimension 
is central to the analysis of methods that minimize 
the empirical 0-1 risk, it is not relevant to SVMs. In- 
deed, for any kernel that is sufficiently rich to allow 
a universal consistency result, the RKHS necessarily 
has infinite VC dimension. See Bartlett and Shawe- 
Taylor (1999) for an explanation of the role of the 
VC dimension and other complexity measures that 
are more appropriate to the analysis of the finite- 
sample behavior of SVMs. 

Moreover, the problem of estimation of a posteri- 
ori class probabilities using SVM classification out- 
puts is not open, in the following sense. It is an easy 
calculation to see that, for any x, the minimizer of 
the conditional expectation E[(ft(Y f (x))\X = x] is 
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f*(x) = sign(27/(rr) - 1), where n(x) = Pt[Y = 1\X = 
x]. Furthermore, results of Steinwart (2003) estab- 
lish that, under reasonable conditions on the kernel 
function and the rate at which the regularization co- 
efficients go to zero, the function f n chosen by the 
SVM converges to /*. Thus, asymptotically there is 
no information about the a posteriori class proba- 
bilities in the SVM classification outputs. Thus the 
SVM framework does not appear to provide an ap- 
propriate starting point for the estimation of a pos- 
teriori probabilities. This is a distinguishing feature 
of the hinge loss <f>: its minimization cannot corre- 
spond to fitting a probability model, since it is in- 
different to distinct values of the class probability. 

To elaborate on this point, note that one of the 
attractive features of SVM classifiers is their sparse- 
ness: The proportion of nonzero dual variables ("sup- 
port vectors") is typically small. Steinwart (2004) 
has presented a beautiful relationship between the 
number of support vectors in an SVM and the Bayes 
risk. Assuming that the kernel is appropriately cho- 
sen and the regularization is reduced sufficiently slowly 
as the sample size increases, the asymptotic propor- 
tion of support vectors is equal to twice the Bayes 
risk. On the other hand, it is known that if we re- 
place the hinge loss <fi with the differentiable quadratic 
loss, sparseness disappears, but then the a posteri- 
ori class probabilities can be estimated asymptoti- 
cally. Indeed, sparseness and the ability to estimate 
conditional probabilities seem to be incompatible. 
If the hinge loss is replaced by any of a large fam- 
ily of loss functions, it can be shown (Bartlett and 
Tewari, 2004) that the proportion of support vectors 
approaches the expectation of a certain function of 
the conditional probability r](X), and this function 
is maximal for those values of rj(X) for which es- 
timation of the conditional probability is possible 
asymptotically. 

KERNEL METHODS 

While the SVM has helped to bring RKHS ideas 
to new prominence, focusing on the SVM runs the 
risk of limiting the appreciation of the scope and po- 
tential impact of RKHS methods. In this section we 
augment the presentation by Moguerza and Muhoz 
to provide a broader context for the understanding 
of the RKHS aspect of the SVM approach. 

The main point that we wish to make in this sec- 
tion is this: A RKHS provides computationally ef- 
ficient machinery for evaluating and optimizing a 



variety of statistical functionals of interest. The em- 
pirical loss functions of nonparametric classification 
and regression are a special case — one in which the 
focus is on the basis function expansions provided 
by a RKHS — but there are other roles for a RKHS. 

A key idea in RKHS methodology is that an in- 
ner product can be computed by evaluating a repro- 
ducing kernel k(x,x') — a function of two arguments 
that obeys symmetry and positive definiteness con- 
ditions. A reproducing kernel may take the form of 
an analytic function (e.g., the Gaussian kernel) or 
may take the form of a computational procedure 
(e.g., the string kernel, which involves a dynamic 
program) . 

Given a set of n data points {x\,X2, ■ ■ ■ ,x n } and 
given a reproducing kernel k(x,x'), one can form 
the Gram matrix. The SVM and the other kernel 
methods mentioned in Section 5 of Moguerza and 
Muhoz are all based on various operations (compu- 
tation of eigenvectors, determinants, inverses, etc.) 
on Gram matrices. This reduction of the data to a 
Gram matrix is significant computationally; indeed, 
while the naive computational complexity of many 
kernel methods is 0(n 3 ), the exploitation of more 
sophisticated numerical linear back ends can drive 
this cost down to 0(nk 2 ), where A; is a measure of 
the effective rank of a Gram matrix. The effective 
rank is typically small. 

Let us now consider a problem that at first glance 
seems to have little to do with kernel methods — the 
problem of assessing whether random variables X\ 
and X2 are independent. Independence can be re- 
duced to correlation by considering transformations 
of the random variables. In particular, X\ and X2 
are independent if and only if 

p= max Corr(/ii(Xi),/i 2 pf 2 )) = 

for a suitably rich function space TL. Indeed, if 7i is 
L2 and thus contains the Fourier basis, this state- 
ment reduces to a classical fact about characteristic 
functions. More interestingly, the result also holds 
for certain RKHSs. Moreover, the reproducing prop- 
erty of the kernel implies that function evaluation 
in a RKHS reduces to an inner product, hi(X\) = 
(k(- , X±) , h±) , where k(-, •) is the reproducing kernel 
for Tt and (•,•) is the corresponding inner product. 
Thus correlations can be computed as 

Corv(h 1 (X 1 ),h 2 (X2)) 

= C0TT{{k(;X 1 ),h 1 },{k(;X 2 ),h 2 )). 
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Maximizing over h\ and hi thus amounts to max- 
imizing the correlation between projections of vec- 
tors in a pair of Hilbert spaces; this is nothing but 
canonical correlation analysis (CCA) in 7i (cf. Leur- 
gans, Moyeed and Silverman, 1993). Moreover, us- 
ing the reproducing property it is easy to show that 
this functional CCA computation can be reduced to 
a generalized eigenvector problem on a pair of Gram 
matrices (one Gram matrix for the observations of 
X\ and another Gram matrix for the observations of 
X2). Thus we can assess independence by carrying 
out a kernelized version of CCA. 

This line of argument is due to Bach and Jor- 
dan (2002), who showed how it could be used to 
fit a semiparametric model known as independent 
component analysis. The general approach has been 
carried further by Gretton et al. (2005), who estab- 
lished relationships between RKHS-based measures 
of independence and mutual information. 

Further work in this vein has shown how RKHS 
ideas can be used to develop computationally effi- 
cient methods for fitting a wide range of other non- 
parametric and semiparametric models. Consider, 
for example, the problem of sufficient dimension 
reduction (SDR) for regression (Cook, 1998). This 
problem can be formulated as the problem of finding 
a matrix Bq whose column space spans a subspace 
S of the covariate space and which satisfies 

PY\x{y\x) =p YlB T X (y\B%x). 

That is, the regression of Y on X depends only on 
the projection of X on S. Note that no additional 
assumptions are made about the regression function. 
We see that SDR can be viewed in terms of an asser- 
tion of conditional independence. Fukumizu, Bach 
and Jordan (2006) have shown how such assertions 
can be evaluated in terms of covariance operators 
on RKHSs. In particular, they showed that Bq can 
be alternatively characterized as 

Bq = argminSy y , x , 
B 

where Eyy^ is a conditional covariance operator on 
a RKHS, based on the kernel function k B {x,x') = 
k(B T x, B T x'). This operator can be estimated and 
minimized (in the sense of the partial order of self- 
adjoint operators) using Gram matrices. Under weak 
conditions, this yields a consistent procedure for es- 
timating Bq. 



CONCLUSIONS 

A statistician who encounters SVMs for the first 
time might have difficulty understanding the source 
of the excitement. After all, the SVM is a modest 
variation on some standard statistical methodology — 
it involves RKHS expansions of discriminant or re- 
gression functions combined with a simple piecewise- 
linear loss function. Nonetheless, this combination 
has noteworthy practical consequences. In particu- 
lar, by paying careful attention to the optimization 
problem that arises in the SVM and by paying care- 
ful attention to the resulting numerical linear alge- 
bra, the SVM can be applied to very large classi- 
fication and regression problems. Moreover, these 
lessons extend beyond the specific setting of the 
SVM. As we have emphasized, the key ideas of con- 
vex relaxation and reproducing kernels have appli- 
cations well beyond the SVM. They permit an ap- 
proach to nonparametric statistics that blends tools 
from nearby areas of applied mathematics such as 
optimization theory, functional analysis, numerical 
linear algebra and combinatorics, undeniably expand- 
ing the scope of activities in nonparametric statis- 
tics and expanding the scale of problems that can 
be addressed. 

REFERENCES 

Arora, S., Babai, L., Stern, J. and Sweedyk, Z. (1997). 
The hardness of approximate optima in lattices, codes, and 
systems of linear equations. J. Comput. System Sci. 54 317- 
331. MR1462727 

Bach, F. R. and Jordan, M. I. (2002). Kernel indepen- 
dent component analysis. J. Mach. Learn. Res. 3 1-48. 
MR1966051 

Bartlett, P. L. (1998). The sample complexity of pattern 
classification with neural networks: The size of the weights 
is more important than the size of the network. IEEE 
Trans. Inform. Theory 44 525-536. MR1607706 

Bartlett, P. L., Bousquet, O. and Mendelson, S. 

(2005) . Local Rademacher complexities. Ann. Statist. 33 
1497-1537. MR2166554 

Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. 

(2006) . Convexity, classification and risk bounds. J. Amer. 
Statist. Assoc. 101 138-156. 

Bartlett, P. L. and Shawe-Taylor, J. (1999). Generaliza- 
tion performance of support vector machines and other pat- 
tern classifiers. In Advances in Kernel Methods — Support 
Vector Learning (B. Scholkopf, C. J. C. Burges and A. J. 
Smola, eds.) 43-54. MIT Press, Cambridge, MA. 

Bartlett, P. L. and Tewari, A. (2004). Sparseness versus 
estimating conditional probabilities: Some asymptotic re- 
sults. Learning Theory. Lecture Notes in Comput. Sci. 3120 
564-578. Springer, Berlin. MR2177935 



6 



P. L. BARTLETT, M. I. JORDAN AND J. D. MCAULIFFE 



Blanchard, G., Bousquet, O. and Massart, 
P. (2006). Statistical performance of sup- 
port vector machines. Preprint. Available at 
ida.first.fraunhofer.de/~blanchard/publi/BlaBouMas06_revl 

Borwein, J. M. and Lewis, A. S. (2000). Convex Anal- 
ysis and Nonlinear Optimization. Springer, New York. 
MR1757448 

Boyd, S. and Vandenberghe, L. (2004). Convex Optimiza- 
tion. Cambridge Univ. Press. MR2061575 

COOK, R. D. (1998). Regression Graphics. Wiley, New York. 
MR1645673 

Fukumizu, K., Bach, F. R. and Jordan, M. I. (2006). Ker- 
nel dimension reduction for regression. Technical report, 
Dept. Statistics, Univ. California, Berkeley. 

Gretton, A., Herbrich, R., Smola, A., Bousquet, O. 
and Scholkopf, B. (2005). Kernel methods for measuring 
independence. J. Mach. Learn. Res. 6 2075-2129. 

Hiriart-Urruty, J.-B. and Lemarechal, C. (1993). Con- 
vex Analysis and Minimization Algorithms 1, 2. Springer, 
Berlin. MR1261420, MR1295240 

Leurgans, S. E., Moyeed, R. A. and Silverman, B. 
(1993). Canonical correlation analysis when the data are 
curves. J. Roy. Statist. Soc. Ser. B 55 725-740. MR1223939 

Mendelson, S. (2002). Geometric parameters of kernel ma- 
chines. Computational Learning Theory. Lecture Notes in 
Comput. Sci. 2375 29-43. Springer, Berlin. MR2040403 

Schapire, R. E., Freund, Y., Bartlett, P. L. and Lee, 
W. S. (1998). Boosting the margin: A new explanation for 
the effectiveness of voting methods. Ann. Statist. 26 1651- 
1686. MR1673273 



Schapire, R. E. and Singer, Y. (1999). Improved boost- 
ing algorithms using confidence-rated predictions. Machine 
Learning 37 297-336. 

S^JUwe-Taylor, J., Bartlett, P. L., Williamson, R. 
C. and Anthony, M. (1998). Structural risk minimiza- 
tion over data-dependent hierarchies. IEEE Trans. Inform. 
Theory 44 1926-1940. MR1664055 

Steinwart, I. (2002). Support vector machines are univer- 
sally consistent. J. Complexity 18 768-791. MR1928806 

Steinwart, I. (2003). Sparseness of support vector machines. 
J. Mach. Learn. Res. 4 1071-1105. MR2125346 

Steinwart, I. (2004). Sparseness of support vector 
machines — Some asymptotically sharp bounds. In Ad- 
vances in Neural Information Processing Systems (B. 
Scholkopf, L. K. Saul and S. Thrun, eds.) 16 1069-1076. 
MIT Press, Cambridge, MA. 

Steinwart, I. (2005). Consistency of support vector ma- 
chines and other regularized kernel machines. IEEE Trans. 
Inform. Theory 51 128-142. MR2234577 

Williamson, R. C, Smola, A. J. and Scholkopf, B. 
(1999). Entropy numbers, operators and support vector 
kernels. In Advances in Kernel Methods — Support Vector 
Learning (B. Scholkopf, C. J. C. Burges and A. J. Smola, 
eds.) 127-144. MIT Press, Cambridge, MA. 

Zhang, T. (2004). Statistical behavior and consistency of 
classification methods based on convex risk minimization. 
Ann. Statist. 32 56-85. MR2051001 

Zhu, J. and Hastie, T. (2005). Kernel logistic regression and 
the import vector machine. J. Comput. Graph. Statist. 14 
185-205. MR2137897 



